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Abstract. Marginal structural models were introduced in order to provide 
estimates of causal effects from interventions based on observational studies in 
epidemiological research. The key point is that this can be understood in terms 
of Girsanov's change of measure. This offers a mathematical interpretation of 
marginal structural models that has not been available before. We consider 
both a model of an observational study and a model of a hypothetical ran- 
domized trial. These models correspond to different martingale measures, the 
observational measure and the randomized trial measure, on some underlying 
space. We describe situations where the randomized trial measure is abso- 
lutely continuous with respect to the observational measure. The resulting 
continuous time likelihood ratio process with respect to these two probabil- 
ity measures corresponds to the weights in discrete time marginal structural 
models. In order to do inference for the hypothetical randomized trial, we can 
simulate samples using observational data weighted by this likelihood ratio. 



1. Introduction 

We will consider the following scenario: A patient has a disease. In order to 
avoid an event as for instance death, a specific treatment can be given. The given 
treatment will typically depend on the patient's previous health condition. 

We would like to estimate the effect of a given treatment on the time to the 
occurrence of the event. A natural way to do so is to implement some sort of 
randomized trial. This means that we would have to set up an experiment on a 
group of patients where the treatment was initiated by randomization independently 
of each patient's previous health condition. Such a study typically require a lot 
of resources and may not be available. In order to take advantage of another 
type of data, we could try to base our estimates of the treatment effect on an 
observational study. Suppose we have observations of a group of patients where 
the given treatments were chosen by doctors. As a first attempt, one could try to 
compute the relative short-term risk between the group given treatment and the 
group not given treatment at a time. This could for instance be done using Cox 
proportional hazards regression techniques. However, such a naive analysis would 
most likely introduce a bias compared to the estimate based on the randomized trial. 
The reason is that the health condition of the patient not already on treatment will 
be a predictor of both treatment and death, i.e. it is likely to be a confoundcr 
|SHL+05 . 

We can easily imagine two opposite scenarios where this confounder would com- 
plicate estimates: Due to considerable costs, reduced life quality or possibly drug 
resistance, one could decide that the treatment should not be initiated until the 
patients are sufficiently ill. A naive marginal analysis based on data from an ob- 
servational study would then quickly lead us to believe that the treatment effect 
was less than the true treatment effect. Conversely, if we for some reason decided 
only to initiate treatment for patients with a good health condition and not for the 
ones with a poor condition, then a naive marginal analysis would quickly lead us 
to believe that the treatment effect was better than the true treatment effect. 
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In order to solve this problem, one might suggest to compute an estimate of the 
treatment effect conditionally on the health condition of the patient. However, in 
several situations, it is likely that the previous treatment will improve the patient's 
general intermediate health condition. This improvement will in itself typically 
postpone the time of death. The conditional effect estimate we described would only 
incorporate the direct treatment effect, not the effect that is due to an improvement 
of the patient's intermediate health condition. 

There is also another source of bias that we have to consider in order to lay hands 
on the causal effect of treatment, that is censoring. We assume that a patient may 
drop out of the study at a time and not return, i.e. we have right censoring. The 
given treatment, calender time and the patient's health condition might lead to 
such a drop out. If we do a naive analysis based on the patients that are still in 
the study, then we introduce a selection bias HHDR04 . 

We are forced to move outside the standard Cox regression framework since 
we have to deal with the mentioned time dependent confounder effects due to a 
patient's underlying health condition. In order to provide a meaningful estimate 
of the treatment effect, with a simple interpretation, we could try to construct a 
rich model that also describes the dynamics of the underlying biological processes. 
Such mechanisms are likely to be very complicated and there might not be sufficient 
knowledge or data available. For this reason we could try to fit a marginal model 
of a suitable randomized trial for our scenario. This will be our strategy in what 
follows. 

One attempt to provide a marginal estimate of the causal treatment effect this 
way is due to J. Robins and is presented in IRHB00I . This method uses marginal 
structural models and relies on the additional assumption that there are no un- 
measured confounders, i.e. there does not exist an unobserved process that is a 
predictor of both censoring and treatment, both censoring and event or both treat- 
ment and event, given the observed covariates. If every such process is measured, 
then the MSM approach provides a proper adjustment of the marginal effect esti- 
mates. The idea is to apply some clever weights to the observations. This weighting 
results in a pseudo population that is different from the observed population. The 
key property of this pseudo population is that the selection bias and the treatment 
confounding due to the patient's health condition become negligible. Now, one can 
for instance proceed with a weighted Cox regression to obtain a marginal estimate 
of the effect of treatment. The method has been used several times on epidemio- 
logical studies. In [HBROO it was used to estimate the effect of Zidovudine on the 
survival of HlV-Positive men in the Multi center AIDS Cohort Study. Moreover, 
the method was also used in [SHL+05] to give an estimate of the hazard ratio for 
the effect of highly active antiviral treatment (HAART) on progression to AIDS or 
death for HIV patients in Switzerland. 

The method introduced by J. Robins deals with longitudinal data in discrete 
time. We will consider continuous time versions of the marginal structural models 
for event history data. The idea is to characterize reasonable models of a random- 
ized trial, the randomized trial measures, using martingale theory. This offers a 
mathematical interpretation of marginal structural models that has not been avail- 
able before. 

We characterize a class of of reasonable models of randomized trials in terms of 
local independence. Such a model corresponds to a particular martingale measure. 
The continuous time likelihood ratio process between this measure and the obser- 
vational probability measure corresponds to the weights in a discrete time marginal 
structural model. In order to do inference for this new measure, we can simulate 
samples using the observed data weighted by this likelihood ratio. 
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Another approach to causal inference within our scenario is to use the so called 
structural nested models. These models were also introduced by J. Robins, sec 
|Rob92] and |Rob98] . J. Lok has developed continuous time versions of such models, 
using counting processes and martingale theory, see [Lok08 . 



Before we come to the main results, we will spend some time to establish ter- 
minology. Even if the mathematics involved is fairly standard stochastic process 
theory, it is perhaps not so commonly used in event history analysis. A very good 
background reference on stochastic processes, that we will use many times, is |JS03j . 

2.1. Observable processes. In section[3]we will consider a stochastic model of a 
single patient. There are typically many factors that are important for describing 
how the disease of that individual develops in time. We will consider models where 
all the possible observations of one patient are represented by stochastic integrals 
against Poisson processes. More formally, let d,n £ H and consider a probability 
space (fi, .F, Q) with mutually orthogonal counting processes iV/ , . . . , JV" on the 
interval [0, T] and a filtration {J-t}t that is generated by their joint history and 
some initial information The counting processes are assumed to be Poisson 
processes in the sense that 



define Q-martingales, the compensated Poisson processes. The probability measure 
Q will only play a role as a reference measure, as we will mainly be interested in 
probability measures that are absolutely continuous with respect to Q. This will 
some times be referred to as the Poisson measure. 

We let H be a bounded and J^-predictable d x n-matrix valued process and let 
Xq denote a bounded J"o-measurable random vector. Now, define the d-dimensional 
observable process : 



All the possible observations of a patient in our approach will be processes of this 
form. Counting processes are trivially included in this class, but we also allow 
slightly more complicated jump processes. One example could be measurements 
of blood values. Each time the blood value is updated would be given by a jump 
and corresponds to a jump time of the underlying counting process. The size and 
direction of the jump would then be given by the value of the predictable integrand 
H at the jump time. 

2.2. Separability. We will say that two observable processes X and Y are sepa- 
rable if they allow the representations: 



where N x and N Y are independent components of the multivariate process N, Xq 
and Yq are bounded J^-measurable random vectors and H x and H Y are bounded 
matrix valued processes that are predictable with respect to the histories of N x 
and N Y respectively. Separability is a technical assumption that provides well 
behaved factorizations of likelihoods. This is used in the proof of Theorem [TJ 
Heuristically, it means that the processes X and Y do reflect different random 



2. Observable processes and local independence 



N] := N{ - t, 



...,N;-.= N?-t 
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phenomena. Separability is even stronger than orthogonality since the processes 
are independent with respect to the Poisson measure Q. However, since we will deal 
with other probability measures that are absolutely continuous with respect to the 
Poisson measure, separable processes can not necessarily be treated as independent. 

2.3. A martingale measure. As we have mentioned earlier, our samples will 
consist of paths of observable processes. These samples will be distributed according 
to some probability measure P such that a given family of predictable and non 
negative processes define the jump intensities for N 1 , . . . ,N n with respect to Tt- 
Since we assume that the observations are distributed according to P, we will refer 
to such a measure as an observational measure. 

More formally, we let A 1 ,... , A" be non- negative J^-predictable processes and 
we assume that P is a probability measure such that: 

(1) P and Q coincide on Fq, 

(2) P <C Q, i.e. P is absolutely continuous with respect to the Poisson measure, 

(3) The equation 



Jo 

defines a square integrable P- martingale with respect to Tt for every i. 

These properties characterize the probability measure P uniquely if such a measure 
exists, [JSu3l Theorem III 1.26]. 

2.4. Non-influence. We will need a notion of non-influence between observable 
processes. There are several formal definitions that are meant to capture this, 
see [FF96 . Independence, or even conditional independence, is too strong to be 
of interest for the method we have in mind. The non-influence relation we will 
consider is local independence. Heuristically, a process X is locally independent 
of a process Y if information about the past of Y does not contribute to a better 
prediction of the short term behavior of X. 

In the setting of event history analysis, this concept has been studied thoroughly 
by V. Didelez, |Did08j . T. Schweder |Sch70j used this concept in a study of com- 
posable Markov processes. O. Aalen et. al made use of local independence in order 
to study the effect of menopause on the risk of developing a certain skin disease in 



2.5. Local independence. Let X, Y, Z be observable processes that are mutually 
separable. The processes X t — Xq, Y t — Yq and Z t — Z$ are obviously independent 
with respect to the probability measure Q. However, the situation is typically more 
complex with respect to the measure P, since the jump intensities Xj , . . . , A" could 
depend on all the information in Tt- ■ We therefore introduce the following concept: 

Definition 1. Let T^' Y ' Z denote the filtration generated by N x , N Y , N z and let 
T t ' denote the filtration generated by N x and N z . We say that X is locally 
independent of Y , given Z , if there exists an T t ' -predictable process fi such that 



defines a local P -martingale with respect to T t ' ' . If this is the case, then we 
write: 




[AKT80] . 




Y X\Z. 
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2.6. Independent censoring. Local independence generalizes a much used con- 
cept in event history analysis, that is independent censoring. Suppose we can follow 
a group of individuals in a clinical trial. We would like to compute the probability 
for an individual to survive longer than time t. However, an individual might be 
censored at some time before the event due to the end of the study or a "drop-out" . 
Inference is much simpler if the censoring does not influence the instantaneous risk 
of the event. Therefore it is common to assume independent censoring. This means 
that an individual at risk have the same instantaneous risk of an event as he would 
in the situation without censoring. More formally, this means that if Tjj is the time 
of the event and Tq is the time of censoring then the compensator of the process 
D t := I(t > To) with respect to the joint event and censoring history only de- 
pends on the event history. This is essentially the same as saying that D is locally 
independent of the process defined by Ct := I(t > Tq). 

2.7. Local independence before a stopping time. Sometimes we may not be 
interested in dependencies that are considered trivial. This could for instance be 
dependencies due to an absorbing state as for instance death. We will see that we 
can rule out such trivial dependencies if we consider local independence before a 
stopping time r. 

Definition 2. Let r be an J- t -adapted stopping time. We say that X is locally 
independent ofY before r and given Z if there exists an {J 7 ^' }t-predictable process 
fi such that 

ptAT 

N t X AT - / Hsds 
Jo 

defines a local P -martingale with respect to {J-f' Y ' Z } t . If this is the case, then we 
write: 

Y X\Z. 

If we let t denote the time of the first jump of N Y then it is not very hard to 
see, using for instance the explicit representation of T t ' '^-predictable processes 
in |Bre811 Theorem A. 2], that for every J-t-predictable process 7 there exists an 
T t ' -predictable process 7 such that 7s ■ I(S < r) = 75 • 7(5 < r) P a.s. for every 
J^-adapted stopping time S. This means that Y -» r X\Z, i.e. stopping at the first 
jump of N Y rules out every local dependence of Y. 

2.8. Local independence graphs. V. Didelez also considered graphical models 
based on local independence, see |Did08j . These graphs will prove to be very useful 
in order to represent complex models 

Definition 3. We say that a directed graph G = (E, V) is a local independence 
graph if the vertexes correspond to observable processes that are mutually separable 
and such that 

{X,Y)$E X ^ T Y\V\{X,Y}. 
Several examples of such graphs will appear below. 

3. Models of clinical trials 

3.1. A patient model. We will now describe a model of a single patient that par- 
ticipates in a clinical study. We suppose that is on the form (N A ,N C , N D ,N L ) 
where N A , N c , N D are univariate counting processes and N L is a multivariate 
counting process. These counting processes count various events that are impor- 
tant for the development of the disease. 
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3.1.1. The event process. We let T D = mi{t > 0\N t D = 1} and let 



D t := / I(s < T D )dN L s 



be the event process. It jumps from to 1 at the time the event occur. The event 
could be death or for instance progression to AIDS for an HIV patient. 

3.1.2. Measurements of the underlying biological process. The state of an underlying 
biological process reflecting the patient's health condition at time t is given by 



where L is a bounded Jo-measurable random vector and H L is a matrix-valued, 
bounded and J'/'-predictable process. The process L could for instance be mea- 
surements of various blood values. 

3.1.3. Right censoring. We assume that the patient can be right censored, i.e. we 
will not be able to observe the patient after some stopping time Tc- This can 
happen because the study ends, but it can also be a "drop-out" due to poor health 
or recovery. We assume that Tc := inf{s > 0\N^ ^ 0} and define the censoring 
process 



3.1.4. The treatment process. One can switch between two treatments of the patient 
at the stopping time Ta- This could typically be to initiate treatment for a patient 
at risk. We let Ta ■— inf{s > 0\Nf ^ 0} and define the treatment process 



Especially this means that the patient will not initially be on treatment. This 
somewhat limiting assumption can be dropped, but then the considerations around 
the hypothetical randomized trial at baseline will be much more involved. 

3.2. Local independences with respect to the observational measure. The 

process D influences the other processes. However, we consider these dependencies 
as trivial. We will consider local independence before To, because then we will 
automatically have that D -^ Td A\C U L, D -» Td C\A U L and D -» Td L\A U C. 

We also assume that the censoring does not carry any information about the 
short term behavior of the other process that we would not obtain if we left C out 
of the analysis. In terms of local independence this means that C -»t d A\D U L, 
C -^>t d D\ALiL and C -^t d L\AliD. We summarize these local independences in 
the following local independence graph: 






(3.1) 



C L D 
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3.3. Randomized trial measures. Our ultimate goal is to provide estimates of 
the causal effect of a particular treatment based on observations of patients in an 
observational study. 

Hypothetically, one could carry out some randomized trial where the given treat- 
ment did not depend on the previous health condition of the patient. If we had 
observations from such a trial, we could easily provide simple estimators for the 
causal treatment effect that would not require information about the underlying 
biological mechanisms. This is however not the case for us, so, based on obser- 
vations from the observational study, we will try to simulate a counterfactual or 
hypothetical randomized trial. We assume that we have measurements of all the 
relevant processes and variables. Especially, we assume that the process L is com- 
plete in the sense that it gives rise to every event that affects both the short term 
behavior of the treatment and the event, both the censoring and the event or both 
the censoring and the treatment, given the full covariate history. This assumption 
means that all the confounder processes are measured and is usually referred to as 
no unmeasured confounders. 

In order to provide causal interpretation of simple estimators, we should at least 
require the hypothetical trial to satisfy the following: 

(1) Both the underlying biological process and the event process should dy- 
namically behave in the same way in the counterfactual trial and the ob- 
servational study, given the full covariate history, 

(2) One should not allow drop-out due to poor health or recovery, i.e. the 
censoring should not be directly affected by the underlying health process, 
given the event and treatment history, 

(3) Since we consider time dependent treatments, we have to generalize the no- 
tion of a randomized trial slightly. In our counterfactual trial, the patient's 
previous health condition or censoring should not be relevant for the short 
term behavior of the treatment process. Heuristically, this means that the 
randomization should act locally in time. 

The counterfactual trial corresponds to a probability measure P on the space 
(Q, F). We will refer to such a measure as a randomized trial measure. It carries 
the frequencies of the potential observations in the counterfactual randomized trial. 
The above requirements can now be translated into the following: 

(1) The process M % is a local P-martingale with respect to the filtration {Tt\t 
for every i € flUL. This means that both the processes N L and N D 
have the same intensity with respect to the randomized trial measure P as 
with respect to the observational measure P. Moreover, we assume that 
the observational measure and the randomized trial measure coincide at 
baseline, i.e. 

E P [H] =Ep[H] 
for every bounded J-o-measurable random variable H. 

(2) The censoring should be locally independent of the underlying health pro- 
cess L with respect to P, given the event and treatment history, 

(3) The treatment process should be locally independent of the underlying 
health process L and censoring C with respect to P, given the event history. 

We summarize the local independence structure with respect to the randomized 
trial measure P in the following local independence graph: 
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A construction of reasonable randomized trials is given in Theorem O Before 
we come to that, we will consider censoring in the counterfactual trial. In order to 
estimate the total treatment effect, we will consider a marginal model where L is 
unobserved. One natural choice of effect measure in the hypothetical experiment 
could be the hazard of the event process with respect to the filtration {J r f' D }t- In 
order to estimate the hazard with respect to this filtration, we could try to estimate 
the hazard of the event before censoring with respect to the filtration {J 7 f' C ' D } t . 
If in addition, the event process was locally independent of the censoring, given 
the treatment process, then these hazards would coincide before censoring, i.e. we 
would have independent censoring. Especially, this would imply that the censoring 
would not cause bias in the sense that if we did not pay attention to the underlying 
biological process then the hazard of the event would not depend on whether the 
patient had been censored or not. We will see in the next theorem that this is the 
case for the randomized trial measures. 

Theorem 1. If P is a randomized trial measure then we have C -»t d AU D, i.e. 
we have independent censoring in the marginalized model without L. This gives the 
following local independence graph: 

C D 



A" 



Proof. The likelihood ratio process 



St := 



dP\ 



dQ\r t 

is a Q-martingale with respect to Ft- This is shown in JS03] Theorem III 3.4]. The 
n-dimensional Poisson-process N has the martingale representation property with 
respect to the filtration, see [JS031 Theorem III 4.37], so there exist predictable 
processes u 1 , . . . ,u n such that: 



n „t 



t 

uldNl 



where N s := TV] 
Now, let 



and note that 



(3.2) I + V / S s -(nl - l)dT s = 1+ I(S S - > 0)dS s = S t ,Q a.s. 
i=i J o Jo 

The last equality follows from |JS03[ Lemma III 3.6]. 

Let := N{ — fj,\ds and note that since AiV* is bounded, |JS03i Lemma 

III 3.14] says that the quadratic (co)variation process [N\ S] has locally integrable 

variation, so its compensator (N , S) is well defined. We can compute that 

(N\S) t = f S s -(nl-l)d(N\N l ) s = f 5,_(/ii-l)ds, 
Jo Jo 

so we get from Girsanov's theorem, see [JS031 Lemma III 3.14], that 
X - f -^-d(N\S) s = T t A/4 - l)ds = M t W 

Jo "s- Jo 
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is a local P- martingale with respect to {Ft}t for every i < n. 
Now 

ft 



M, 



Ml 



X a - fj, l s ds 



defines a continuous finite variation P-martingale, so /j, 1 = X 1 P a.s. a.e. and 
S = £(K) where K t :— Y^i=i fo(K) ~ l)<^s an d £ is the stochastic exponential. 
Let Kf := J* X° - \dN° s and A' t L := Kf - K t . Since [K L ,K C ] = Q a.s., we have 
that: 

(3.3) £{K) = £(K C + K L ) = £(K C + K L + [K c , K L ]) =£(K C )£(K L ). 
The last equality follows from [Pro05 ( Theorem II 38]. 



We now consider filtrations corresponding to the a-algebras: Qt := J~^t„ > *5? 



A,C,D rL ._ T 



A,D,L j n C,h t- 

and 9 t := F t AT D - 



tAT D ' y * 

Moreover, we let X denote the 



tAT D > v t • J tATj 

^t-predictable projection of A D , see |JS03j [Theorem I 2.28]. It is the unique 
predictable process such that 

E[X%\Gs-] 

for every C/ t -predictable stopping time S. 
The local independence relations: 

(1) L^T D C\AUC 

(2) C ^T D L\AUD 

(3) C -^t d A\L U D 

(4) C ^t d D\A U L 

and (|3.3[) provides a factorization 

dP\ g c,i 



dQ\ g c 1 



nL qC 

°t ' °t i 



where S L is C/ t L -predictable and S c is ^-predictable. Bayes' theorem now gives 
that whenever F is Gt- -measurable and bounded then 



(3.4) 



E[F\G t -]=E[F\Gt] 



P a.s. Therefore, if we let F be bounded and <? t -predictable, then we can compute: 



E 



F s Xfds 



ds 



E 



E 



F.E[\?\G?_] 



F s X?ds 



E 

ds - 



F S E[X°\G S -] 



ds 



E 



e[fX\g?_] 



ds 



If we let Mf = D t — J* Af ds and M t D = D t — J Q Xfds, then we can compute: 



E 



F„dMz 



= E 
= E 
= E 



F,dD s 



FAD, 







- E 


[ F s X§ds 




Jo 






- E 


[ F s X§ds 




Jo 



F s dM, 



0. 
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so M D is a P-martingale with respect to the filtration . Now, since Q t C Q± , 
we have that X D is ^("-predictable: 

M t D = E[M° D \Q°] = E[M° D I(T <t) + M? D I(T > t)\Q°] 
= M° D I(T < t) +E[M° D I(T > t)\QC) 
= Mgl(T < t) + E[Mg \T t A ' C ' D ]l(T > t) 
= E[M? D \T( 



-A,C,D-\ 
-t 



i.e. M D is a P- martingale with respect to the filtration {J r f ' C D }t- 
Finally, we note that 

(3.5) N t A ATD - [ X A I(s < T D )ds 

Jo 

defines a P-martingale with respect to Ft- Since Xf is T A,D -predictable, we also 
see that (|3.5p defines a martingale with respect to {J rA ' C ' D } t . □ 



4. Existence of randomized trial measures 

We have now come to the construction of randomized trial measures. The idea 
is to construct a reasonable randomized trial measure P from the observational 
measure P such that P <C P. The absolute continuity is important since this 
provides a natural method for simulating the empirical expectation of random vari- 
ables as if the data was sampled from the counterf actual trial, while actually using 
P-distributed samples. To get an idea of how this is done, let J € N, let H be 
a bounded random variable and let u)\ , . . . , u) j be J independently P-distributed 
samples from f2. The law of large numbers then yields: 

= Ep[H], P a.s. 

Heuristically, this means that the likelihood ratio can be viewed as a transformation 
from the observational study into the counterf actual scenario. 

There might exist several reasonable counterfactual trials, each corresponding to 
a choice of a well-behaved treatment and censoring strategies. Given a non-negative 
J^'^-predictable process X A and a non- negative J r t j4 ' C ' D -predictable process X°\ we 
can consider the problem of finding a randomized trial measure P that has X A as 
the J^-intensity of N A and X as the J-t-intensity of N c . This suggests to say 
that X A is the treatment strategy and X c is the censoring strategy in the coun- 
terfactual trial. We will consider the counterfactual treatment strategy given by 
the P-intensity of N A with respect to T t ' . The counterfactual censoring strat- 
egy will be given by the P-intensity of N c with respect to T A,D,C ' . This gives 
a randomized trial measure with a likelihood ratio that heuristically corresponds 
to the stabilized weights one usually considers in the discrete time marginal struc- 
tural models, |RHB00) . The problem of finding such a randomized trial measure 
is a martingale problem. Note that this problem might not have a solution. The 
next theorem shows that if the counterfactual strategies are not too different from 
the observed intensities, then there exists a unique corresponding randomized trial 
measure P. 

Theorem 2. Suppose that there exist positive numbers 0\ and 02 such that: 

(4.i) xt-e 1 Jx A <E P [x A \^i D ]<x A + e 1 Jx A 



1 dP 

}Tool^dP^ )H{ ^ )=EP 



dp 
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and 

(4.2) Af - 9 a y/>° < Ep[\ c s \Tfl c ' D ] < Af + 9 2 ^c 

for almost every s P a.s. Let \ A denote the P -intensity of N A with respect to 

the filtration {F^' D }t o,nd let \ c denote the P -intensity of N c with respect to the 
filtration {j u f' C:D } t . 
The equation 

(4.3) 

* - n (|) AN ° «p ( 1 v - *>) • n (f ) AN ° ( jf *? - 

defines a square integrable P '-martingale with respect to the filtration {J- t }t- More- 
over, 

dP = R T dP 

defines a randomized trial measure on (f2, J 7 ) such that the martingale dynamics of 
the biological processes D and L coincide for the two probability measures P and 
P, i.e. 

L t - H^\ L s ds and D t - Af I(s < T D )ds 
Jo Jo 

also define P -martingales with respect to {J-t}t- 
Proof. We define 

K t :=j\^-l)dM? + j\^-l)dM?. 
By the innovation theorem we have that 

(4.4) \f = E P [\ A \Tfl D ] and Af = E P [Af | Tff' D ] P a.s. , s a.e. 
By P~T) and (|4~2"j) we have that 



We therefore obtain that: 

(^^) t = ^(^-l) 2 A^ + (^|-l) 2 Afd S 
< (^i + 2 ) • t 

Since (if, if) is bounded on the interval [0,T], [LM781 Theorem II. 1] yields that 
the stochastic exponential i? := £(K), given by the SDE: 

(4.5) R t = l+ f R s -dK a , 

Jo 

is square integrable. 
Now 

dP = R T dP 

defines a probability measure P on (Q,J-). 
Note that we have: 

M D - ( —d(M D ,R) s = M D and M l - / ——d(M l ,R) s = M l 

Jo "s- Jo -Rs- 

P a.s. for every Z G L. Moreover, Girsanov's theorem .IS03 Lemma III 3.14], gives 
that these processes define local P- martingales with respect to the filtration {J-t}t- 
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Moreover, 

(M A , R) t = J R s _ - l)d(M A , M A ) S = J P s _ (Xf - X A )ds, 
so again by Girsanov's theorem, we have that 

M A - f — ^ d(Al A , R) s = M A - f Xf - Xfds = N A [ Xfds 

JO n s- Jo Jo 

defines a P-martingale with respect to the filtration {J-"t}t- Analogously, we have 
that 

M t D - f J-d(M D ,R) s =N t D - fxfds 

Jo H s- Jo 

defines a local P-martingale with respect to the filtration {J-t}t- 

Finally, we note that by [JS031 Theorem I 4.60], the SDE (g~5|) has the explicit 
solution given by (|4.3p . Expressions of this form are well known in the literature 
on marked point process, see for instance |Jac75) . □ 

Remark 1. Note that the condition (|4.1D heuristically means that the short-term 
"risk" of starting treatment, given the previous full history, 

lim h^Pit <T A <t + h\Tt) 

h^O K 1 ' 

is not too different from the short-term "risk" of starting treatment, when we do 
not pay attention to the underlying health process or censoring, i. e. 

lim h- x P(t <T A <t + h\F^' D ). 

h^O v 

Similarly, the condition (|4.2I) heuristically means that the short-term "risk" of 
being censored, given the previous full history, is not too different from the short- 
term "risk" of being censored, when we do not pay attention to the underlying health 
process. In words, this means that the previous history of the health process L alone 
can not at any time yield too high short-term "risk" of starting treatment or being 
censored. 

5. Weighted additive hazard regression 

Suppose that we have observations of m independent individuals until death 
or censoring from an observational study and want to estimate the total effect of 
treatment. Ideally, we would like to base our estimate on some randomized trial. 
However, such a trial might not be available. The marginal structural approach now 
suggests to simulate a counterfactual randomized trial, using the data we already 
have. We would then like to estimate the counterfactual hazard, i.e. the hazard 
that the patient would have if he, contrary to the fact, had participated in the 
randomized trial. In this way we could compare the total effect of being treatment 
versus never being on treatment before. 

We assume that the observations from each individual is P distributed. The 
observations of the patient would have been P-distributed if he, contrary to the 
fact, participated in the hypothetical randomized trial. 

We assume that the counterfactual intensity follows Allen's additive hazard re- 
gression model, see [ABGK93] and |ABG08j . To formalize this, let /3° and [3 1 be 
functions on [0,T]. We assume that we only have instantaneous effect and assume 
that the hazard for the event, with respect to the treatment and event history is 
given by: 

# + #^- 

One way to estimate the hazard from the counterfactual trial is to weight the 
observations by the corresponding likelihood ratios. We will prove that a suitable 
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weighted variant of Aalen's additive hazard regression gives consistent estimators 
of J */3jds and J* j3\ds from independent P distributed observations. This requires 
some notation. Let Yj- , . . . , Y™ be the "at-risk" indicators and A} , . . . , A™ be the 
"at-treatment" indicators for the m independent individuals. We define the m x 2- 
matrix: 

'Y t x Y?-A\_ 



X 



(m) 



Moreover, let R\ , . . . , be the individual likelihood ratios at time t and let 



R { ™ ] = 



(YtRl- o 
Y t 2 R 2 t _ 



\ 





\ ... Y t m R™J 

Finally, let D*,...,!)™ be event processes for the m individuals before t. The 
observed events are now given by the vector 

rt - 



D ( t m) = 



(JoYidDl 
\J* Y s m dD™ 



Theorem 3. We assume that 

(1) The F '-intensity of D with respect to the filtration J- t , \ D is dominated by 
an integrable function G, 

(2) (Positivity) Both the "at-risk" groups in the counter] 'actual trial are always 
present, i.e. 

Ep[Y s A s ^] > and Ep[Y s (l - A,-)] > 
for every s € [0, T], 

(3) There exist integrable and left continuous functions with right limits /3° and 
/3 1 such that (3 = (/3°,^ 1 ) T and such that 

D[ m} - fx^Ms 
Jo 

is a P -martingale with respect to the filtration T^ C ' D , i.e. Y t ((3® + filA\_) 



is the P -intensity of D w.r.t. the filtration T x 



t 

A,C,D 



We let 



j( m) ;= W ^Bi yiil - A\_) > and ^i?j_F t Mj_ > 

^ i=l i=l 

B ( t m) := /* 4 m \x^ T R i ™ ) X^)- 1 X^ T Rl™ ) dDi m \ 



and 

B t := [ f3 s ds, 
Jo 

Now _B( m ) is a consistent estimator of B t in the sense that: 

YmiP{d{B {m \B) > e) = 

m 

for every e > ; where d denotes the Skorohod metric, see [JSQ3] or [Bil99 . 
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4 - 








and M t (m 













Proof. We define: 

We will often drop the index (m) in order to simplify the notation. Another sim- 
plification of the notation we will use is E[-} for the expectation with respect to P 
and E[-] for the expectation with respect to P. 
First we prove that 

(5.1) limpf sup | / J a (X?R B -X B )- 1 X?R B -\ a da -B t \>e)=0 

m \t<T Jo / 

for every e > 0. Define: 

V:=[\ °) and S,.- ^i^(l-^-) _ 0. 



The matrix V is invertible and using the fact that the Y % and the A are indicators, 
we have: 

V T XjR t _X t V = St, 

i.e. Xj R t -X t is congruent to the diagonal matrix St- A simple matrix computation 
gives that 

{XjRt.Xt)- 1 = VS^V T 

when J s > 0. 

Now, we see that: 

JtixjRt-Xt^xjRt-Xt = J t y < Ht 



where 



1 Y™ -4-) 4 E™i^Mt ■ 

Since i?[F t A t _] > 0, the law of large numbers implies that H® converges in 
probability to 

E[Y t {l-A t -)] L * ' U ' ' J 

Analogously, since E[Y t (l — A t -)} > 0, we have that H\ converges in probability to 



a^0 = ^[Af|y^-i = i]. 

E[Y t A t _] 



By a similar argument, we see that {ji™' 1 } converges in probability to 1 for almost 
every s. 

Since the P-intensity of D with respect to J r f' ,c,D coincides with -E^Af IJ^fl ' ] 
P a.s. for almost every s, we have that : 

E[X D \Y t (l - A t -) = 1] = Pt and E[X?\Y t A t - = 1] = ft + 

This means that 

(5.2) J^XjRt.X^XjRt.Xt 
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converges in probability to /3 t when m increases. Note that 



E 



sup 
t 



■ ( J/1 ) / -W-s 



3 S ds, 



so by the dominated convergence theorem, we obtain (15.11) . 
We will now prove that 



Z t ~B t - / J s (XjR s _X s )- l XjR a _\ s ds 
Jo 

= f {XjR a -X B )- 1 XjR a -dM a 
Jo 

converges weakly to 0. Note that: 

(Z,Z) t = f J S {X? R^X^X? R s _d(M,M) s R s _X s {XT R^Xs)- 1 
Jo 

= [ J s VS- l V 7 'Xj 'R 8 _d(M,M) s R 8 -X 3 VS- 1 V T 
Jo 



v I j s s; 2 u s dsV T , 



where 



Now, 



E 


sup 


L 




t 





VJ s S- 2 U s V T ds 



< f E[\VJ s S- 2 U s V T \]ds, 
Jo 



so, by the dominated convergence theorem, {{Z^ m \ Z^)} converges uniformly in 
probability to 0. 
We define 



7 (e,m) 



J l(\Js{X 7 s 'Rs-Xs)- 1 X t 'R s -Y s \ > e^jj s {XjR s _X s )- 1 X T R s _dM s 



and see that 



Since both 



< (Z (e ' m) ,Z (e ' m) ) t < (Z (m) ,Z (m) ) t . 

{{Z^ m \Z^)} m and {{Z^ m \Z^)} m 

converge uniformly in probability to 0, the central limit theorem for martingales, 
[JS031 Theorem VIII 3.22], implies that {Z^} m converges weakly to 0. 
We have that 

= z (rn) + f* J a (xJ R^X^X? R.- X^ds, 

Jo 

so the sequence {B^} m is the sum of two C-tight sequences. JS03, Corollary VI 
3.33] implies that then the sequence itself is also C-tight. By Slutsky's theorem, 
the finite dimensional distributions on the form C{b{™ 1 \ . . . , b[™^) converge weakly 
to the Dirac measures: #e t ,...,s t . . This means that {B^ m ^} m converges in law and 
therefore in probability to B. □ 
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6. Concluding remarks 

We have shown that marginal structural modeling can be understood in terms 
of change of probability measures. The author believes that this is an elucidating 
point of view that is natural in the framework of modern probability theory. 

As stressed by several authors, there is a very important and highly non-trivial 
assumption one has to make in order to interpret effects from marginal structural 
models as causal. This is the assumption of no unmeasured confounders, or equiv- 
alently: all confounders are measured. This means that every process that affects 
the short-term behavior of both the treatment and the censoring or both the treat- 
ment and the event must be observed. In this equivalent form, it becomes more 
apparent that this is just an assumption about completeness of the model. This 
completeness assumption is not that mysterious. When modeling various phenom- 
ena in the natural sciences one typically assumes that all the important variables 
are contained in the model. This is also necessary in the MSM approach. However, 
it is important to note that this is not in general a statistically testable assumption. 
It is neither a condition that would follow from a mathematical argument without 
further assumptions about the model. 

Hcuristically, the MSM approach provides an adjustment of the the treatment 
effect bias caused by the measured confounders. What one essentially does in the 
marginal structural model approach is that instead of modeling the underlying, and 
potentially very complicated, biology one models a randomized trial instead. The 
problem of computing marginal effects then splits into two parts: The first problem 
is to model the marginal intensity of the event in the simulated "randomized trial" . 
If one knew the corresponding likelihood ratio process, then this would be obtain- 
able using for instance the weighted additive hazard regression from the previous 
section. In order to compute this likelihood ratio one has to deal with the second 
part of our problem. That is to model the dynamics of the treatment and censoring 
processes given the full and the marginal history in the observational study. This 
is a crucial point. We have chosen not to deal with this problem in the current 
paper. However, one could use regression techniques to do this at least approxi- 
mately. In the discrete time setting one typically uses pooled logistic regressions, 
see jSHL+05] and [HBROOj . In the continuous time setting it is probably more 
natural to use additive hazard or Poisson regression to estimate the censoring and 
treatment intensities both with respect to the full covariate history and marginal 
covariate history. This will be the topic in future work. Once these intensities are 
known, one can compute the likelihood ratio process using (|4.3p . 
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